{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "```\n",
        "# Copyright 2022 Google Inc.\n",
        "\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "\n",
        "#     http://www.apache.org/licenses/LICENSE-2.0\n",
        "\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License.\n",
        "```"
      ],
      "metadata": {
        "id": "Nc5z20tucbYc"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
	"This colab supports UniProt 2022_04, where Google predicted protein names for 88% of all Uncharacterized proteins (over 1 in 5 proteins in the database).",
	"\n",
	"\n",
	"You can run this file to check whether any prediction produced by Google's systems is supported by other sources. Note that some of the predicted names have since been updated in UniProt 2022_05.",
        "\n",
	"\n",
        "**Paste in the UniProt accession of the protein below!**"
      ],
      "metadata": {
        "id": "NrIKMuHM_pS2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Imports and dependencies"
      ],
      "metadata": {
        "id": "yvGAanHAsf04"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!apt-get install pigz"
      ],
      "metadata": {
        "id": "o43XlscFsCIZ",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "0b469c80-b410-460c-ebf3-56170b56198c"
      },
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Reading package lists... Done\n",
            "Building dependency tree       \n",
            "Reading state information... Done\n",
            "The following package was automatically installed and is no longer required:\n",
            "  libnvidia-common-460\n",
            "Use 'apt autoremove' to remove it.\n",
            "The following NEW packages will be installed:\n",
            "  pigz\n",
            "0 upgraded, 1 newly installed, 0 to remove and 12 not upgraded.\n",
            "Need to get 57.4 kB of archives.\n",
            "After this operation, 259 kB of additional disk space will be used.\n",
            "Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 pigz amd64 2.4-1 [57.4 kB]\n",
            "Fetched 57.4 kB in 1s (99.5 kB/s)\n",
            "Selecting previously unselected package pigz.\n",
            "(Reading database ... 123934 files and directories currently installed.)\n",
            "Preparing to unpack .../archives/pigz_2.4-1_amd64.deb ...\n",
            "Unpacking pigz (2.4-1) ...\n",
            "Setting up pigz (2.4-1) ...\n",
            "Processing triggers for man-db (2.8.3-2ubuntu0.1) ...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install binary_file_search"
      ],
      "metadata": {
        "id": "YhaVGrgU98Qe",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "b06bc114-bbfd-4b33-98b7-6c93fb1aa2c8"
      },
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Collecting binary_file_search\n",
            "  Downloading binary_file_search-0.7.tar.gz (4.0 kB)\n",
            "Building wheels for collected packages: binary-file-search\n",
            "  Building wheel for binary-file-search (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for binary-file-search: filename=binary_file_search-0.7-py3-none-any.whl size=3970 sha256=ad37736153f21eae44011a843bbd7814e1c66c7a8fb442460426e61c6f9694cb\n",
            "  Stored in directory: /root/.cache/pip/wheels/7c/a9/91/fed3d9ba96b88a121ddb4dc5198f12e14119c4e59afe26fd8f\n",
            "Successfully built binary-file-search\n",
            "Installing collected packages: binary-file-search\n",
            "Successfully installed binary-file-search-0.7\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "ymlaxuFFWFlE"
      },
      "outputs": [],
      "source": [
        "from binary_file_search.BinaryFileSearch import BinaryFileSearch\n",
        "import IPython.display\n",
        "def print_markdown(string):\n",
        "    IPython.display.display(IPython.display.Markdown(string))"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Download evidence file and unzip\n",
        "(takes a few minutes)"
      ],
      "metadata": {
        "id": "2NbvPJy2sjFf"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!wget -c -O sorted_evidencer.csv.gz https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/evidencer_sorted.csv.gz"
      ],
      "metadata": {
        "id": "DC0CFu7m9tDY",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "2145e933-2f58-4357-aae3-cdba226b08d8"
      },
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2022-10-11 18:00:36--  https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/evidencer_sorted.csv.gz\n",
            "Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128, 172.253.117.128, 142.250.99.128, ...\n",
            "Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 2358963308 (2.2G) [application/octet-stream]\n",
            "Saving to: ‘sorted_evidencer.csv.gz’\n",
            "\n",
            "sorted_evidencer.cs 100%[===================>]   2.20G  67.9MB/s    in 31s     \n",
            "\n",
            "2022-10-11 18:01:08 (72.2 MB/s) - ‘sorted_evidencer.csv.gz’ saved [2358963308/2358963308]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!unpigz -fk sorted_evidencer.csv.gz"
      ],
      "metadata": {
        "id": "g7crO7tz919X"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Search for your evidence"
      ],
      "metadata": {
        "id": "cU4GrPdE_SfH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "accession = \"A0A009\" #@param {type:\"string\"}\n",
        "with BinaryFileSearch('sorted_evidencer.csv', sep=\",\", string_mode=True) as bfs:\n",
        "    lines = bfs.search(accession)\n",
        "\n",
        "if len(lines) == 0:\n",
        "  raise ValueError('Sorry, this protein\\'s accession wasn\\'t found in our database. Maybe check your spelling, or maybe it\\'s a very new protein?')\n",
        "if len(lines) > 1:\n",
        "  for l in lines:\n",
        "    print(l)\n",
        "  raise ValueError('There was some sort of error - we found multiple predictions for this protein!', lines)\n",
        "\n",
        "accession, prediction, support_in_uniprot, support_in_uniref, support_in_uniref_example = tuple(lines[0])\n",
        "\n",
        "if support_in_uniprot and support_in_uniprot != \"NULL\":\n",
        "  print_markdown(f\"The prediction **{prediction}** for **{accession}** appears as a substring of the **UniProt page** for **{accession}**.\\n\")\n",
        "  print_markdown(f\"More info at https://www.uniprot.org/uniprotkb/{accession}/entry\")\n",
        "elif support_in_uniref and support_in_uniref != \"NULL\":\n",
        "  print_markdown(f\"The prediction **{prediction}** for **{accession}** appears as a **substring of UniRef50 cluster** member {support_in_uniref_example}.\\n\")\n",
        "  print_markdown(f\"More info on on https://www.uniprot.org/uniprotkb/{support_in_uniref_example}/entry\")\n",
        "else:\n",
        "  print_markdown(f\"The prediction **{prediction}** for **{accession}** has **no support** found with these automated methods. Perhaps the protein is very new?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        },
        "cellView": "form",
        "id": "N17c3oC39_0e",
        "outputId": "8a012ecd-7e93-4d88-f026-0fbb8d6919a7"
      },
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ],
            "text/markdown": "The prediction **Carbamoyltransferase** for **A0A009** appears as a **substring of UniRef50 cluster** member D6A7G3.\n"
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ],
            "text/markdown": "More info on on https://www.uniprot.org/uniprotkb/D6A7G3/entry"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# License of downloaded data\n",
        "\n",
        "The evidencer file is licensed CC-BY 4.0 and is built based on UniProt.\n",
        "\n",
        "\"UniProt: the universal protein knowledgebase in 2021.\" Nucleic acids research 49, no. D1 (2021): D480-D489."
      ],
      "metadata": {
        "id": "jnCOhzbTJ1RM"
      }
    }
  ]
}
